{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <center>RegEx in Python</center>\n",
    "<img src=\"images/memes/meme15.png\" height=500 width=700>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Boundary Matchers\n",
    "\n",
    "Consider a scenario where you want to find all occurances of `and`, `or` and `the` in the given text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "from utils import highlight_regex_matches"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "txt = \"\"\"\n",
    "Lorem Ipsum is simply dummy text of the printing and typesetting industry. \n",
    "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, \n",
    "when an unknown printer took a galley of type and scrambled it to make a type specimen book. \n",
    "It has survived not only five centuries, but also the leap into electronic typesetting, \n",
    "remaining essentially unchanged. \n",
    "It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, \n",
    "and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "pattern = re.compile(\"and|or|the\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['or',\n",
       " 'the',\n",
       " 'and',\n",
       " 'or',\n",
       " 'the',\n",
       " 'and',\n",
       " 'the',\n",
       " 'and',\n",
       " 'the',\n",
       " 'the',\n",
       " 'the',\n",
       " 'or',\n",
       " 'and',\n",
       " 'or',\n",
       " 'or']"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pattern.findall(txt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "L\u001b[43m\u001b[1mor\u001b[0mem Ipsum is simply dummy text of \u001b[43m\u001b[1mthe\u001b[0m printing \u001b[43m\u001b[1mand\u001b[0m typesetting industry. \n",
      "L\u001b[43m\u001b[1mor\u001b[0mem Ipsum has been \u001b[43m\u001b[1mthe\u001b[0m industry's st\u001b[43m\u001b[1mand\u001b[0mard dummy text ever since \u001b[43m\u001b[1mthe\u001b[0m 1500s, \n",
      "when an unknown printer took a galley of type \u001b[43m\u001b[1mand\u001b[0m scrambled it to make a type specimen book. \n",
      "It has survived not only five centuries, but also \u001b[43m\u001b[1mthe\u001b[0m leap into electronic typesetting, \n",
      "remaining essentially unchanged. \n",
      "It was popularised in \u001b[43m\u001b[1mthe\u001b[0m 1960s with \u001b[43m\u001b[1mthe\u001b[0m release of Letraset sheets containing L\u001b[43m\u001b[1mor\u001b[0mem Ipsum passages, \n",
      "\u001b[43m\u001b[1mand\u001b[0m m\u001b[43m\u001b[1mor\u001b[0me recently with desktop publishing software like Aldus PageMaker including versions of L\u001b[43m\u001b[1mor\u001b[0mem Ipsum.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "highlight_regex_matches(pattern, txt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is a slight problem with the above pattern. `and`, `or`, `the` inside the words are also counted as a match where as we want to find individual strings containing `and`, `or`, `the` only.\n",
    "\n",
    "### What is the solution?\n",
    "\n",
    "Solution is to use this pattern:\n",
    "\n",
    "`\\b(and|or|the)\\b`\n",
    "\n",
    "where `\\b` is a metacharacter that matches at a position that is called a **word boundary**. \n",
    "\n",
    "Such identifiers that correspond to a particular position inside of the input are called **Boundary Matchers**.\n",
    "\n",
    "**Note:** Since `\\b` is also an escape sequence for strings in Python, we need to escape it using `\\`, i.e. `\\\\b`,  in order to treat it like a metacharacter for regex matching."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "pattern = re.compile(\"\\\\b(and|or|the)\\\\b\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Lorem Ipsum is simply dummy text of \u001b[43m\u001b[1mthe\u001b[0m printing \u001b[43m\u001b[1mand\u001b[0m typesetting industry. \n",
      "Lorem Ipsum has been \u001b[43m\u001b[1mthe\u001b[0m industry's standard dummy text ever since \u001b[43m\u001b[1mthe\u001b[0m 1500s, \n",
      "when an unknown printer took a galley of type \u001b[43m\u001b[1mand\u001b[0m scrambled it to make a type specimen book. \n",
      "It has survived not only five centuries, but also \u001b[43m\u001b[1mthe\u001b[0m leap into electronic typesetting, \n",
      "remaining essentially unchanged. \n",
      "It was popularised in \u001b[43m\u001b[1mthe\u001b[0m 1960s with \u001b[43m\u001b[1mthe\u001b[0m release of Letraset sheets containing Lorem Ipsum passages, \n",
      "\u001b[43m\u001b[1mand\u001b[0m more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "highlight_regex_matches(pattern, txt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is a table which shows the list of all boundary matchers available in Python:\n",
    "\n",
    "<table style=\"border: 1px solid black; font-size:15px;\">\n",
    "<thead>\n",
    "    <th>Matcher</th>\n",
    "    <th>Description</th>\n",
    "</thead>\n",
    "    \n",
    "<tbody>\n",
    "<tr>\n",
    "    <td>^</td>\n",
    "    <td>Matches at the beginning of a line</td>\n",
    "</tr>\n",
    "    \n",
    "<tr>\n",
    "    <td>$</td>\n",
    "    <td>Matches at the end of a line</td>\n",
    "</tr>\n",
    "\n",
    "<tr>\n",
    "    <td>\\b</td>\n",
    "    <td>Matches a word boundary</td>\n",
    "</tr>\n",
    "\n",
    "<tr>\n",
    "    <td>\\B</td>\n",
    "    <td>Matches the opposite of \\b. Anything that is not a word boundary</td>\n",
    "</tr>\n",
    "\n",
    "<tr>\n",
    "    <td>\\A</td>\n",
    "    <td>Matches the beginning of the input</td>\n",
    "</tr>\n",
    "\n",
    "<tr>\n",
    "    <td>\\Z</td>\n",
    "    <td>Matches the end of the input</td>\n",
    "</tr>\n",
    "</tbody>\n",
    "</table>\n",
    "\n",
    "### Example 1\n",
    "\n",
    "Consider a scenario where we want to find all the lines in the given text which **start** with the pattern `Name:`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "txt = \"\"\"\n",
    "Name:\n",
    "Age: 0\n",
    "Roll No.: 15\n",
    "Grade: S\n",
    "\n",
    "Name: Ravi\n",
    "Age: -1\n",
    "Roll No.: 123 Name: ABC\n",
    "Grade: K\n",
    "\n",
    "Name: Ram\n",
    "Age: N/A\n",
    "Roll No.: 1\n",
    "Grade: G\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "pattern = re.compile(\"^Name: \\w+\", flags=re.M)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Name: Ravi', 'Name: Ram']"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pattern.findall(txt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> `re.M` (short for `re.MULTILINE`) is a flag which is used to make begin/end `(^, $)` consider each line.\n",
    "\n",
    "### Example 2\n",
    "\n",
    "Find all the sentences which do not end with a full stop (`.`) in the given text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "txt = \"\"\"\n",
    "Lorem Ipsum is simply dummy text of the printing and typesetting industry.\n",
    "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s!\n",
    "It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.\n",
    "It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages\n",
    "More recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "pattern = re.compile(\"^.+[^\\.]$\", flags=re.M)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[\"Lorem Ipsum has been the industry's standard dummy text ever since the 1500s!\",\n",
       " 'It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages']"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pattern.findall(txt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Lorem Ipsum is simply dummy text of the printing and typesetting industry.\n",
      "\u001b[43m\u001b[1mLorem Ipsum has been the industry's standard dummy text ever since the 1500s!\u001b[0m\n",
      "It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.\n",
      "\u001b[43m\u001b[1mIt was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages\u001b[0m\n",
      "More recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.\n"
     ]
    }
   ],
   "source": [
    "highlight_regex_matches(pattern, txt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](images/memes/meme16.png)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
